AITopics | Web Mining

Collaborating Authors

Web Mining

News Overviews Instructional Materials AI-Alerts Classics

One companys devious plan to stop AI web scrapers from stealing your content

MashableMar-23-2025, 19:28:12 GMT

AI is stealing your content. We know this is how AI companies have built their highly-valued businesses – by scraping the web and using your data to train their chatbots. In the past, websites could rely on simple protocols like robots.txt to define what could, and could not, be used by web crawlers. Those guidelines were respected by the companies doing the scraping to, say, build results for search engines. AI companies, however, are not abiding by this social contract and are ignoring those instructions.

artificial intelligence, data mining, information management, (10 more...)

Mashable

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Information Management > Search (0.98)
Information Technology > Data Science > Data Mining > Web Mining (0.78)

Add feedback

The Synergy of Automated Pipelines with Prompt Engineering and Generative AI in Web Crawling

Huang, Chau-Jian

arXiv.org Artificial IntelligenceDec-29-2024

Web crawling is a critical technique for extracting online data, yet it poses challenges due to webpage diversity and anti-scraping mechanisms. This study investigates the integration of generative AI tools Claude AI (Sonnet 3.5) and ChatGPT4.0 with prompt engineering to automate web scraping. Using two prompts, PROMPT I (general inference, tested on Yahoo News) and PROMPT II (element-specific, tested on Coupons.com), we evaluate the code quality and performance of AI-generated scripts. Claude AI consistently outperformed ChatGPT-4.0 in script quality and adaptability, as confirmed by predefined evaluation metrics, including functionality, readability, modularity, and robustness. Performance data were collected through manual testing and structured scoring by three evaluators. Visualizations further illustrate Claude AI's superiority. Anti-scraping solutions, including undetected_chromedriver, Selenium, and fake_useragent, were incorporated to enhance performance. This paper demonstrates how generative AI combined with prompt engineering can simplify and improve web scraping workflows.

data mining, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2502.15691

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Services (0.69)

Technology:

Information Technology > Data Science > Data Mining > Web Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.94)

Add feedback

XPath Agent: An Efficient XPath Programming Agent Based on LLM for Web Crawler

Li, Yu, Wang, Bryce, Luan, Xinyu

arXiv.org Artificial IntelligenceDec-17-2024

We present XPath Agent, a production-ready XPath programming agent specifically designed for web crawling and web GUI testing. A key feature of XPath Agent is its ability to automatically generate XPath queries from a set of sampled web pages using a single natural language query. To demonstrate its effectiveness, we benchmark XPath Agent against a state-of-the-art XPath programming agent across a range of web crawling tasks. Our results show that XPath Agent achieves comparable performance metrics while significantly reducing token usage and improving clock-time efficiency. The well-designed two-stage pipeline allows for seamless integration into existing web crawling or web GUI testing workflows, thereby saving time and effort in manual XPath query development. The source code for XPath Agent is available at https://github.com/eavae/feilian.

data mining, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2502.15688

Country: North America > United States (0.14)

Genre: Research Report > New Finding (0.86)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining > Web Mining (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(3 more...)

Add feedback

Condé Nast has reportedly accused AI search startup Perplexity of plagiarism

EngadgetJul-22-2024, 19:16:39 GMT

Condé Nast, the media conglomerate that owns publications such as The New Yorker, Vogue and Wired, has sent a cease-and-desist letter to AI-powered search startup Perplexity, according to The Information. The letter, which was sent on Monday, demands that Perplexity stop using content from Condé Nast publications in its AI-generated responses and accused the startup of plagiarism. The move makes Condé Nast the latest in a growing list of publishers taking a stand against the unauthorized use of their content by AI companies, and comes a month after similar action taken by Forbes. Perplexity and Condé Nast did not immediately respond to a request for comment from Engadget. A recent investigation from Wired reveled that the startup's web crawlers do not respect robots.txt,

artificial intelligence, data mining, perplexity, (10 more...)

Engadget

Country: North America > United States > New York (0.27)

Industry: Media (0.79)

Technology:

Information Technology > Artificial Intelligence (1.00)
Information Technology > Data Science > Data Mining > Web Mining (0.39)

Add feedback

Cleaner Pretraining Corpus Curation with Neural Web Scraping

Xu, Zhipeng, Liu, Zhenghao, Yan, Yukun, Liu, Zhiyuan, Yu, Ge, Xiong, Chenyan

arXiv.org Artificial IntelligenceJun-14-2024

The web contains large-scale, diverse, and abundant information to satisfy the information-seeking needs of humans. Through meticulous data collection, preprocessing, and curation, webpages can be used as a fundamental data resource for language model pretraining. However, when confronted with the progressively revolutionized and intricate nature of webpages, rule-based/feature-based web scrapers are becoming increasingly inadequate. This paper presents a simple, fast, and effective Neural web Scraper (NeuScraper) to help extract primary and clean text contents from webpages. Experimental results show that NeuScraper surpasses the baseline scrapers by achieving more than a 20% improvement, demonstrating its potential in extracting higher-quality data to facilitate the language model pretraining. All of the code is available at https://github.com/OpenMatch/NeuScraper.

data mining, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2402.14652

Country:

Europe (0.68)
Asia > China (0.28)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Communications > Web (0.95)
(2 more...)

Add feedback

A Japanese-Chinese Parallel Corpus Using Crowdsourcing for Web Mining

Nagata, Masaaki, Morishita, Makoto, Chousa, Katsuki, Yasuda, Norihito

arXiv.org Artificial IntelligenceMay-14-2024

Using crowdsourcing, we collected more than 10,000 URL pairs (parallel top page pairs) of bilingual websites that contain parallel documents and created a Japanese-Chinese parallel corpus of 4.6M sentence pairs from these websites. We used a Japanese-Chinese bilingual dictionary of 160K word pairs for document and sentence alignment. We then used high-quality 1.2M Japanese-Chinese sentence pairs to train a parallel corpus filter based on statistical language models and word translation probabilities. We compared the translation accuracy of the model trained on these 4.6M sentence pairs with that of the model trained on Japanese-Chinese sentence pairs from CCMatrix (12.4M), a parallel corpus from global web mining. Although our corpus is only one-third the size of CCMatrix, we found that the accuracy of the two models was comparable and confirmed that it is feasible to use crowdsourcing for web mining of parallel data.

artificial intelligence, data mining, natural language, (18 more...)

arXiv.org Artificial Intelligence

2405.09017

Country:

Europe (1.00)
Asia (1.00)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.95)
Information Technology > Communications > Social Media > Crowdsourcing (0.95)
Information Technology > Data Science > Data Mining > Web Mining (0.81)

Add feedback

Web crawler strategies for web pages under robot.txt restriction

Vyas, Piyush, Chauhan, Akhilesh, Mandge, Tushar, Hardikar, Surbhi

arXiv.org Artificial IntelligenceFeb-28-2024

In the present time, all know about World Wide Web and work over the Internet daily. In this paper, we introduce the search engines working for keywords that are entered by users to find something. The search engine uses different search algorithms for convenient results for providing to the net surfer. Net surfers go with the top search results but how did the results of web pages get higher ranks over search engines? how the search engine got that all the web pages in the database? This paper gives the answers to all these kinds of basic questions. Web crawlers working for search engines and robot exclusion protocol rules for web crawlers are also addressed in this research paper. Webmaster uses different restriction facts in robot.txt file to instruct web crawler, some basic formats of robot.txt are also mentioned in this paper.

artificial intelligence, information retrieval, natural language, (15 more...)

arXiv.org Artificial Intelligence

2308.04689

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.15)

Genre: Research Report (0.50)

Industry: Information Technology (0.47)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Data Science > Data Mining > Web Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (1.00)

Add feedback

New York Times, CNN and Australia's ABC block OpenAI's GPTBot web crawler from accessing content

The Guardian > TechnologyAug-25-2023, 00:31:40 GMT

News outlets including the New York Times, CNN, Reuters and the Australian Broadcasting Corporation (ABC) have blocked a tool from OpenAI, limiting the company's ability to continue accessing their content. OpenAI is behind one of the best known artificial intelligence chatbots, ChatGPT. Its web crawler – known as GPTBot – may scan webpages to help improve its AI models. The Verge was first to report the New York Times had blocked GPTBot on its website. The Guardian subsequently found that other major news websites, including CNN, Reuters, the Chicago Tribune, the ABC and Australian Community Media (ACM) brands such as the Canberra Times and the Newcastle Herald, appear to have also disallowed the web crawler.

australia government, data mining, natural language, (14 more...)

The Guardian > Technology

AI-Alerts: 2023 > 2023-08 > AAAI AI-Alert for Aug 29, 2023 (1.00)

Country:

North America > United States > Illinois > Cook County > Chicago (0.27)
Oceania > Australia > Australian Capital Territory > Canberra (0.26)

Industry:

Media > News (0.60)
Government > Regional Government > Oceania Government > Australia Government (0.37)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Chatbots (1.00)
Information Technology > Data Science > Data Mining > Web Mining (0.85)

Add feedback

Should we trust web-scraped data?

Foerderer, Jens

arXiv.org Artificial IntelligenceAug-4-2023

The increasing adoption of econometric and machine-learning approaches by empirical researchers has led to a widespread use of one data collection method: web scraping. Web scraping refers to the use of automated computer programs to access websites and download their content. The key argument of this paper is that na\"ive web scraping procedures can lead to sampling bias in the collected data. This article describes three sources of sampling bias in web-scraped data. More specifically, sampling bias emerges from web content being volatile (i.e., being subject to change), personalized (i.e., presented in response to request characteristics), and unindexed (i.e., abundance of a population register). In a series of examples, I illustrate the prevalence and magnitude of sampling bias. To support researchers and reviewers, this paper provides recommendations on anticipating, detecting, and overcoming sampling bias in web-scraped data.

artificial intelligence, data mining, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2308.02231

Country: North America > United States > New Jersey (0.28)

Genre: Research Report (1.00)

Industry:

Retail (0.68)
Information Technology > Services (0.68)
Government (0.67)
Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Data Science > Data Mining > Web Mining (0.76)

Add feedback

Comparative analysis of various web crawler algorithms

K, Nithin T, S, Chandana, G, Barani, Dharani, Chavva, Karishma, M S

arXiv.org Artificial IntelligenceJun-21-2023

This presentation focuses on the importance of web crawling and page ranking algorithms in dealing with the massive amount of data present on the World Wide Web. As the web continues to grow exponentially, efficient search and retrieval methods become crucial. Web crawling is a process that converts unstructured data into structured data, enabling effective information retrieval. Additionally, page ranking algorithms play a significant role in assessing the quality and popularity of web pages. The presentation explores the background of these algorithms and evaluates five different crawling algorithms: Shark Search, Priority-Based Queue, Naive Bayes, Breadth-First, and Depth-First. The goal is to identify the most effective algorithm for crawling web pages. By understanding these algorithms, we can enhance our ability to navigate the web and extract valuable information efficiently.

data mining, information retrieval, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2306.12027

Genre: Research Report (1.00)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Communications > Web (1.00)
Information Technology > Data Science > Data Mining > Web Mining (0.96)
(3 more...)

Add feedback